Biostat 203B Homework 3

Due Feb 23 @ 11:59PM

Author

Zhi Li, UID: 506333161

Display machine information for reproducibility:

sessionInfo()

Load necessary libraries (you can add more as needed).

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(gtsummary)
library(memuse)
library(pryr)
library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.26.0 (2024-01-24 05:12:50 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:arrow':

    timestamp
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose()      masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract()      masks R.utils::extract()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
✖ purrr::partial()      masks pryr::partial()
✖ dplyr::where()        masks pryr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Attaching package: 'GGally'

The following object is masked from 'package:R.utils':

    wrap

Display your machine memory.

memuse::Sys.meminfo()

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with subject_id 10013310 using ggplot.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

Answer

patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz")
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
patients_tble
# A tibble: 299,712 × 6
   subject_id gender anchor_age anchor_year anchor_year_group dod       
        <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
 1   10000032 F              52        2180 2014 - 2016       2180-09-09
 2   10000048 F              23        2126 2008 - 2010       NA        
 3   10000068 F              19        2160 2008 - 2010       NA        
 4   10000084 M              72        2160 2017 - 2019       2161-02-13
 5   10000102 F              27        2136 2008 - 2010       NA        
 6   10000108 M              25        2163 2014 - 2016       NA        
 7   10000115 M              24        2154 2017 - 2019       NA        
 8   10000117 F              48        2174 2008 - 2010       NA        
 9   10000178 F              59        2157 2017 - 2019       NA        
10   10000248 M              34        2192 2014 - 2016       NA        
# ℹ 299,702 more rows
admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz")
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
admissions_tble
# A tibble: 431,231 × 16
   subject_id  hadm_id admittime           dischtime          
        <dbl>    <dbl> <dttm>              <dttm>             
 1   10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00
 2   10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00
 3   10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00
 4   10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00
 5   10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00
 6   10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00
 7   10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00
 8   10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00
 9   10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00
10   10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00
# ℹ 431,221 more rows
# ℹ 12 more variables: deathtime <dttm>, admission_type <chr>,
#   admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>,
#   hospital_expire_flag <dbl>
transfers_tble <- read_csv("~/mimic/hosp/transfers.csv.gz")
Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): eventtype, careunit
dbl  (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
transfers_tble
# A tibble: 1,890,972 × 7
   subject_id  hadm_id transfer_id eventtype careunit        intime             
        <dbl>    <dbl>       <dbl> <chr>     <chr>           <dttm>             
 1   10000032 22595853    33258284 ED        Emergency Depa… 2180-05-06 19:17:00
 2   10000032 22595853    35223874 admit     Transplant      2180-05-06 23:30:00
 3   10000032 22595853    36904543 discharge <NA>            2180-05-07 17:21:27
 4   10000032 22841357    34100253 discharge <NA>            2180-06-27 18:49:12
 5   10000032 22841357    34703856 admit     Transplant      2180-06-26 21:31:00
 6   10000032 22841357    38112554 ED        Emergency Depa… 2180-06-26 15:54:00
 7   10000032 25742920    35509340 admit     Transplant      2180-08-06 01:44:00
 8   10000032 25742920    35968195 ED        Emergency Depa… 2180-08-05 20:58:00
 9   10000032 25742920    38883756 discharge <NA>            2180-08-07 17:50:44
10   10000032 29079034    32952584 ED        Emergency Depa… 2180-07-22 16:24:00
# ℹ 1,890,962 more rows
# ℹ 1 more variable: outtime <dttm>
procedures_icd_tble <- read_csv("~/mimic/hosp/procedures_icd.csv.gz")
Rows: 669186 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): icd_code
dbl  (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
procedures_icd_tble
# A tibble: 669,186 × 6
   subject_id  hadm_id seq_num chartdate  icd_code icd_version
        <dbl>    <dbl>   <dbl> <date>     <chr>          <dbl>
 1   10000032 22595853       1 2180-05-07 5491               9
 2   10000032 22841357       1 2180-06-27 5491               9
 3   10000032 25742920       1 2180-08-06 5491               9
 4   10000068 25022803       1 2160-03-03 8938               9
 5   10000117 27988844       1 2183-09-19 0QS734Z           10
 6   10000280 25852320       1 2151-03-18 8938               9
 7   10000560 28979390       1 2189-10-16 5551               9
 8   10000635 26134563       1 2136-06-19 3734               9
 9   10000635 26134563       2 2136-06-19 3728               9
10   10000635 26134563       3 2136-06-19 3727               9
# ℹ 669,176 more rows
#diagnoses_icd_tble <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz")
#diagnoses_icd_tble
d_icd_procedures_tble <- read_csv("~/mimic/hosp/d_icd_procedures.csv.gz")
Rows: 85257 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_icd_procedures_tble
# A tibble: 85,257 × 3
   icd_code icd_version long_title                                              
   <chr>          <dbl> <chr>                                                   
 1 0001               9 Therapeutic ultrasound of vessels of head and neck      
 2 0002               9 Therapeutic ultrasound of heart                         
 3 0003               9 Therapeutic ultrasound of peripheral vascular vessels   
 4 0009               9 Other therapeutic ultrasound                            
 5 001               10 Central Nervous System and Cranial Nerves, Bypass       
 6 0010               9 Implantation of chemotherapeutic agent                  
 7 0011               9 Infusion of drotrecogin alfa (activated)                
 8 0012               9 Administration of inhaled nitric oxide                  
 9 0013               9 Injection or infusion of nesiritide                     
10 0014               9 Injection or infusion of oxazolidinone class of antibio…
# ℹ 85,247 more rows
d_icd_diagnoses_tble <- read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz")
Rows: 109775 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_icd_diagnoses_tble
# A tibble: 109,775 × 3
   icd_code icd_version long_title                           
   <chr>          <dbl> <chr>                                
 1 0010               9 Cholera due to vibrio cholerae       
 2 0011               9 Cholera due to vibrio cholerae el tor
 3 0019               9 Cholera, unspecified                 
 4 0020               9 Typhoid fever                        
 5 0021               9 Paratyphoid fever A                  
 6 0022               9 Paratyphoid fever B                  
 7 0023               9 Paratyphoid fever C                  
 8 0029               9 Paratyphoid fever, unspecified       
 9 0030               9 Salmonella gastroenteritis           
10 0031               9 Salmonella septicemia                
# ℹ 109,765 more rows

Import transfer.csv.gz as a tibble sid_adt:

sid <- 10013310

sid_adt <- read_csv("~/mimic/hosp/transfers.csv.gz") |>
  filter(subject_id == sid) |>
  print(width = Inf)
Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): eventtype, careunit
dbl  (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 14 × 7
   subject_id  hadm_id transfer_id eventtype
        <dbl>    <dbl>       <dbl> <chr>    
 1   10013310 21243435    31696219 discharge
 2   10013310 21243435    31736720 ED       
 3   10013310 21243435    33511674 transfer 
 4   10013310 21243435    34848129 transfer 
 5   10013310 21243435    38910974 admit    
 6   10013310 22098926    31651850 transfer 
 7   10013310 22098926    32769810 admit    
 8   10013310 22098926    33278851 transfer 
 9   10013310 22098926    34063502 ED       
10   10013310 22098926    36029206 discharge
11   10013310 27682188    30077870 transfer 
12   10013310 27682188    30444898 discharge
13   10013310 27682188    31203589 admit    
14   10013310 27682188    35160955 ED       
   careunit                                        intime             
   <chr>                                           <dttm>             
 1 <NA>                                            2153-06-05 19:58:00
 2 Emergency Department                            2153-05-26 08:56:00
 3 Medicine/Cardiology                             2153-05-26 16:19:26
 4 Medicine/Cardiology                             2153-05-26 14:42:55
 5 Medicine/Cardiology                             2153-05-26 14:18:39
 6 Neuro Intermediate                              2153-06-12 16:31:33
 7 Neuro Surgical Intensive Care Unit (Neuro SICU) 2153-06-10 11:55:42
 8 Medicine                                        2153-06-16 19:03:14
 9 Emergency Department                            2153-06-10 10:40:00
10 <NA>                                            2153-07-21 18:02:28
11 Medicine/Cardiology                             2153-05-07 20:47:19
12 <NA>                                            2153-05-13 15:36:52
13 Coronary Care Unit (CCU)                        2153-05-06 18:28:00
14 Emergency Department                            2153-05-06 10:21:00
   outtime            
   <dttm>             
 1 NA                 
 2 2153-05-26 14:18:39
 3 2153-06-05 19:58:00
 4 2153-05-26 16:19:26
 5 2153-05-26 14:42:55
 6 2153-06-16 19:03:14
 7 2153-06-12 16:31:33
 8 2153-07-21 18:02:28
 9 2153-06-10 11:55:42
10 NA                 
11 2153-05-13 15:36:52
12 NA                 
13 2153-05-07 20:47:19
14 2153-05-06 18:28:00

Store patient’s demographic information as age, sex, and race:

age <- patients_tble %>%
  filter(subject_id == sid) %>%
  pull(anchor_age)

sex <- patients_tble %>%
  filter(subject_id == sid) %>%
  pull(gender)

race <- admissions_tble %>%
  filter(subject_id == sid) %>%
  distinct(race) %>%
  pull()

Store the top 3 diagnoses as top3_diag:

top3_diag_icd <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz") %>%
  filter(subject_id == sid) %>%
  slice_head(n = 3) %>%
  pull(icd_code)
Rows: 4756326 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
top3_diag_icd
[1] "I222"  "I5023" "I428" 
top3_diag <- d_icd_diagnoses_tble %>%
  filter(icd_code %in% top3_diag_icd) %>%
  pull(long_title)
top3_diag
[1] "Subsequent non-ST elevation (NSTEMI) myocardial infarction"
[2] "Other cardiomyopathies"                                    
[3] "Acute on chronic systolic (congestive) heart failure"      

Subset labevents_tble to sid_lab:

sid_lab <- arrow::open_dataset("labevents_pq", format = "parquet") %>%
  filter(subject_id == sid) %>%
  select(charttime) %>%
  collect()
sid_lab
# A tibble: 2,285 × 1
   charttime          
   <dttm>             
 1 2153-05-06 03:30:00
 2 2153-05-06 03:30:00
 3 2153-05-06 03:30:00
 4 2153-05-06 03:30:00
 5 2153-05-06 03:30:00
 6 2153-05-06 03:30:00
 7 2153-05-06 03:30:00
 8 2153-05-06 03:30:00
 9 2153-05-06 03:30:00
10 2153-05-06 03:30:00
# ℹ 2,275 more rows

Subset procedures_icd_tble:

sid_procedure <- procedures_icd_tble %>%
  filter(subject_id == sid) %>%
  select(chartdate, icd_code)
sid_procedure <- left_join(sid_procedure, d_icd_procedures_tble, by = "icd_code") %>%
  select(chartdate, long_title)
sid_procedure$chartdate <- sid_procedure$chartdate %>% as_datetime()
sid_procedure
# A tibble: 9 × 2
  chartdate           long_title                                                
  <dttm>              <chr>                                                     
1 2153-05-27 00:00:00 Measurement of Cardiac Sampling and Pressure, Left Heart,…
2 2153-05-27 00:00:00 Fluoroscopy of Multiple Coronary Arteries using Low Osmol…
3 2153-05-27 00:00:00 Ultrasonography of Multiple Coronary Arteries, Intravascu…
4 2153-06-10 00:00:00 Extirpation of Matter from Intracranial Artery, Percutane…
5 2153-06-10 00:00:00 Introduction of Other Thrombolytic into Peripheral Artery…
6 2153-07-15 00:00:00 Insertion of Feeding Device into Stomach, Percutaneous Ap…
7 2153-06-11 00:00:00 Introduction of Nutritional Substance into Upper GI, Via …
8 2153-05-06 00:00:00 Dilation of Coronary Artery, One Artery with Drug-eluting…
9 2153-05-06 00:00:00 Fluoroscopy of Multiple Coronary Arteries using Other Con…

Generate the final plot:

  ggplot() +
  # ADT
  geom_segment(data = sid_adt %>%
                 filter(eventtype != "discharge"),
               aes(x = intime,
                   xend = outtime,
                   y = "ADT",
                   yend = "ADT",
                   color = careunit,
                   linewidth = str_detect(careunit, "(ICU|CCU)"))) +
  # Lab
  geom_point(data = sid_lab,
             aes(x = charttime,
                 y = "Lab"),
             shape = 3,
             size = 3) +
  # Procedure
  geom_point(data = sid_procedure,
             aes(x = chartdate,
                 y = "Procedure",
                 shape = long_title,
                 size = 3)) +
  scale_shape_manual(values = c(2:10)) +
  #guides(shape = guide_legend(title = "Procedure")) +
  guides(linewidth = "none",
         size = "none",
         color = guide_legend(title = "Care Unit"),
         shape = guide_legend(title = "Procedure", nrow = 5)) +
  labs(
    x = "Calendar Time",
    y = "",
    title = str_c("Patient ", sid, ", ", sex, ", ", age, " years old, ", race),
    subtitle = str_c(top3_diag[1],"\n", top3_diag[2],"\n", top3_diag[3])
  ) +
  theme_light() +
  theme(legend.position = "bottom",
        legend.box = "vertical") +
  scale_y_discrete(limits = c("Procedure", "Lab", "ADT"))
Warning: Using linewidth for a discrete variable is not advised.

# Free RAM usage for subsequent rendering
rm(transfers_tble, 
   procedures_icd_tble, 
   d_icd_procedures_tble, 
   d_icd_diagnoses_tble)

Q1.2 ICU stays

ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Do a similar visualization for the patient 10013310.

Answer

Create a .parquet file for chartevents.csv:

# I chose arrow::open_csv_dataset() for HW2 Q3, so I have to convert it into
# a .parqeut file in this HW.
arrow::open_csv_dataset("/home/zhili/203b-hw/hw2/chartevents.csv") %>%
  arrow::write_dataset(format = "parquet", 
                       path = "/home/zhili/203b-hw/hw2/chartevents_pq")

Read in the .parquet file:

chartevents_tble <- arrow::open_dataset("chartevents_pq",
                                       format = "parquet") %>%
  filter(subject_id == sid) %>%
  filter(itemid %in% c(220045, 220180, 220179, 223761, 220210)) %>%
  collect()
chartevents_tble
# A tibble: 549 × 11
   subject_id  hadm_id  stay_id caregiver_id charttime          
        <int>    <int>    <int>        <int> <dttm>             
 1   10013310 22098926 32769810        59913 2153-06-12 12:01:00
 2   10013310 22098926 32769810        59913 2153-06-12 12:01:00
 3   10013310 22098926 32769810        59913 2153-06-12 12:50:00
 4   10013310 22098926 32769810        59913 2153-06-12 12:50:00
 5   10013310 22098926 32769810        59913 2153-06-12 13:00:00
 6   10013310 22098926 32769810        59913 2153-06-12 13:00:00
 7   10013310 22098926 32769810        59913 2153-06-12 13:00:00
 8   10013310 22098926 32769810        59913 2153-06-12 13:03:00
 9   10013310 22098926 32769810        59913 2153-06-12 13:03:00
10   10013310 22098926 32769810        59913 2153-06-12 14:00:00
# ℹ 539 more rows
# ℹ 6 more variables: storetime <dttm>, itemid <int>, value <chr>,
#   valuenum <dbl>, valueuom <chr>, warning <int>
chartevents_tble$itemid <- as.factor(chartevents_tble$itemid)

Generate the plot:

# labeling the itemid
vital_labels <- c(`220045` = "HR", 
                  `220180` = "NBPd",
                  `220179` = "NBPs", 
                  `220210` = "RR",
                  `223761` = "Temperature")

chartevents_tble$itemid <- factor(chartevents_tble$itemid, levels = names(vital_labels))
ggplot(chartevents_tble, aes(x = charttime, y = valuenum)) +
  geom_line(aes(color = itemid)) +
  geom_point(aes(color = itemid)) +
  facet_grid(itemid ~ stay_id, 
             scales = "free", 
             labeller = labeller(itemid = vital_labels)) +
  labs(
    x = "Calendar Time",
    y = "Vital",
    title = str_c("Patient ", sid, " ICU stays - Vitals")
  ) +
  theme_light() +
  guides(color = "none")

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.

Answer:

icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz")
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
icustays_tble
# A tibble: 73,181 × 8
   subject_id  hadm_id  stay_id first_careunit last_careunit intime             
        <dbl>    <dbl>    <dbl> <chr>          <chr>         <dttm>             
 1   10000032 29079034 39553978 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 2   10000980 26913865 39765666 Medical Inten… Medical Inte… 2189-06-27 08:42:00
 3   10001217 24597018 37067082 Surgical Inte… Surgical Int… 2157-11-20 19:18:02
 4   10001217 27703517 34592300 Surgical Inte… Surgical Int… 2157-12-19 15:42:24
 5   10001725 25563031 31205490 Medical/Surgi… Medical/Surg… 2110-04-11 15:52:22
 6   10001884 26184834 37510196 Medical Inten… Medical Inte… 2131-01-11 04:20:05
 7   10002013 23581541 39060235 Cardiac Vascu… Cardiac Vasc… 2160-05-18 10:00:53
 8   10002155 20345487 32358465 Medical Inten… Medical Inte… 2131-03-09 21:33:00
 9   10002155 23822395 33685454 Coronary Care… Coronary Car… 2129-08-04 12:45:00
10   10002155 28994087 31090461 Medical/Surgi… Medical/Surg… 2130-09-24 00:50:00
# ℹ 73,171 more rows
# ℹ 2 more variables: outtime <dttm>, los <dbl>

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.

Answer:

There are 50,920 unique subject_ids. A subject_id can have multiple ICU stays as shown below. The patient with subject_id 18358138 has 37 ICU stays, the most among all patients.

icustays_tble %>% distinct(subject_id)
# A tibble: 50,920 × 1
   subject_id
        <dbl>
 1   10000032
 2   10000980
 3   10001217
 4   10001725
 5   10001884
 6   10002013
 7   10002155
 8   10002348
 9   10002428
10   10002430
# ℹ 50,910 more rows
icustay_count <- icustays_tble %>%
  count(subject_id) %>%
  arrange(desc(n)) %>%
  print(width = Inf)
# A tibble: 50,920 × 2
   subject_id     n
        <dbl> <int>
 1   18358138    37
 2   12468016    33
 3   17585185    33
 4   13269859    30
 5   18676703    26
 6   11281568    25
 7   15455517    25
 8   15573773    24
 9   17937834    24
10   15131736    22
# ℹ 50,910 more rows
icustay_count %>%
  ggplot(aes(x = n)) +
  geom_bar(color = "darkblue",
           fill = "lightskyblue") +
  labs(
    x = "Number of ICU stays",
    y = "Number of patients",
    title = "Number of ICU stays per patient"
  ) +
  theme_light()

icustay_count %>%
  filter(n > 5) %>%
  ggplot(aes(x = n)) +
  geom_bar(color = "darkblue",
           fill = "lightskyblue") +
  labs(
    x = "Number of ICU stays",
    y = "Number of patients",
    title = "Number of ICU stays per patient (stays >= 6)"
  ) +
  scale_x_continuous(breaks = seq(6, 37, 1)) +
  theme_light()

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble.

Answer:

admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz")
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
admissions_tble
# A tibble: 431,231 × 16
   subject_id  hadm_id admittime           dischtime          
        <dbl>    <dbl> <dttm>              <dttm>             
 1   10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00
 2   10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00
 3   10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00
 4   10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00
 5   10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00
 6   10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00
 7   10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00
 8   10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00
 9   10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00
10   10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00
# ℹ 431,221 more rows
# ℹ 12 more variables: deathtime <dttm>, admission_type <chr>,
#   admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>,
#   hospital_expire_flag <dbl>

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient
  • admission hour (anything unusual?)
  • admission minute (anything unusual?)
  • length of hospital stay (from admission to discharge) (anything unusual?)

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

Answer:

Transform the data and summarize the information:

adm_count <- admissions_tble %>%
  count(subject_id)
adm_count
# A tibble: 180,733 × 2
   subject_id     n
        <dbl> <int>
 1   10000032     4
 2   10000068     1
 3   10000084     2
 4   10000108     1
 5   10000117     2
 6   10000248     1
 7   10000280     1
 8   10000560     1
 9   10000635     1
10   10000719     1
# ℹ 180,723 more rows
adm_hour <- admissions_tble %>%
  select(admittime) %>%
  pull() %>%
  format("%H") %>%
  as_tibble()
adm_hour
# A tibble: 431,231 × 1
   value
   <chr>
 1 22   
 2 18   
 3 23   
 4 12   
 5 23   
 6 01   
 7 05   
 8 23   
 9 02   
10 18   
# ℹ 431,221 more rows
adm_minute <- admissions_tble %>%
  select(admittime) %>%
  pull() %>%
  format("%M") %>%
  as_tibble()
adm_minute
# A tibble: 431,231 × 1
   value
   <chr>
 1 23   
 2 27   
 3 44   
 4 35   
 5 16   
 6 56   
 7 11   
 8 17   
 9 05   
10 10   
# ℹ 431,221 more rows
adm_los <- admissions_tble %>%
  mutate(los = as.numeric(dischtime - admittime, units = "hours")) %>%
  select(los)
adm_los
# A tibble: 431,231 × 1
      los
    <dbl>
 1  18.9 
 2  24.4 
 3  42.1 
 4  53.3 
 5   7.17
 6 109.  
 7  10.9 
 8   9.78
 9  12.8 
10  70.3 
# ℹ 431,221 more rows

Plot the transformed data:

Number of admissions per patient

adm_count %>%
  ggplot(aes(x = n)) +
  geom_histogram(binwidth = 1,
                 color = "darkblue",
                 fill = "lightskyblue") +
  labs(
    x = "Number of admissions",
    y = "Number of patients",
    title = "Number of admissions per patient"
  ) +
  geom_vline(xintercept = 10, color = "red") +
  annotate("text", x = 30, y = -5000, label = "10 admissions", color = "red") +
  theme_light()

Admission hour

adm_hour %>%
  ggplot(aes(x = value)) +
  geom_bar(fill = "lightskyblue",
           color = "darkblue") +
  labs(
    x = "Admission hour",
    y = "Number of admissions",
    title = "Admission hour"
  ) +
  theme_light()

Admission minute

top4admmin <- adm_minute %>% 
              count(value) %>% 
              arrange(desc(n)) %>%
              slice_head(n = 4)

adm_minute %>%
  ggplot(aes(x = as.numeric(value))) +
  geom_histogram(binwidth = 1,
                 color = "darkblue",
                 fill = "lightskyblue") +
  
  labs(
    x = "Admission minute",
    y = "Number of admissions",
    title = "Admission minute"
  ) +
  annotate("text", 
           x = as.numeric(top4admmin[1,1]), 
           y = as.numeric(top4admmin[1,2]) + 1000 , 
           label = str_c(top4admmin[1,1], ": ", top4admmin[1,2]), 
           color = "red") +
  annotate("text", 
           x = as.numeric(top4admmin[2,1]), 
           y = as.numeric(top4admmin[2,2]) + 1000 , 
           label = str_c(top4admmin[2,1], ": ", top4admmin[2,2]), 
           color = "red") +
  annotate("text", 
           x = as.numeric(top4admmin[3,1]), 
           y = as.numeric(top4admmin[3,2]) + 1000 ,
           label = str_c(top4admmin[3,1], ": ", top4admmin[3,2]), 
           color = "red") +
  annotate("text", 
           x = as.numeric(top4admmin[4,1]), 
           y = as.numeric(top4admmin[4,2]) + 1000 ,
           label = str_c(top4admmin[4,1], ": ", top4admmin[4,2]), 
           color = "red") +
  theme_light()

Length of hospital stay

adm_los %>%
  ggplot(aes(x = los)) +
  geom_histogram(binwidth = 24,
                 color = "darkblue",
                 fill = "lightskyblue") +
  labs(
    x = "Length of hospital stay (hours)",
    y = "Number of admissions",
    title = "Length of hospital stay"
  ) +
  theme_light()

adm_los %>%
  filter(los < 1000) %>%
  ggplot(aes(x = los)) +
  geom_histogram(binwidth = 24,
                 color = "darkblue",
                 fill = "lightskyblue") +
  labs(
    x = "Length of hospital stay (hours)",
    y = "Number of admissions",
    title = "Length of hospital stay (< 1,000 hours)"
  ) +
  theme_light()

  • The number of admissions per patient is right-skewed, with most patients having 1 or 2 admissions. There are some outliers distributed evenly from more than 25 admissions, among which we can see a patient that has 238 admissions which is far more than the rest.
adm_count %>% arrange(desc(n))
# A tibble: 180,733 × 2
   subject_id     n
        <dbl> <int>
 1   15496609   238
 2   15464144   185
 3   10714009   163
 4   16662316   142
 5   15229574   130
 6   14394983   129
 7   11553072    98
 8   13475033    97
 9   17517983    95
10   17011846    94
# ℹ 180,723 more rows
  • The admission hour is not showing any unusual pattern. The most admissions occur at 12 a.m. at the midnight and generally there are more admissions from 14 p.m. to 12 a.m. than the rest of the day. Also we can see that there are more admissions at 7 a.m. than the nearby hours.

  • The admission minute is almost distributed evenly, except for the top 4 minutes which are 00, 15, 30, and 45, with frequency of 26,319, 22,243, 12,673, and 9,239, respectively.

  • The length of hospital stay (los) is right-skewed, with most admissions having a length of stay less than 100 hours.

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.

Answer:

patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz")
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
patients_tble
# A tibble: 299,712 × 6
   subject_id gender anchor_age anchor_year anchor_year_group dod       
        <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
 1   10000032 F              52        2180 2014 - 2016       2180-09-09
 2   10000048 F              23        2126 2008 - 2010       NA        
 3   10000068 F              19        2160 2008 - 2010       NA        
 4   10000084 M              72        2160 2017 - 2019       2161-02-13
 5   10000102 F              27        2136 2008 - 2010       NA        
 6   10000108 M              25        2163 2014 - 2016       NA        
 7   10000115 M              24        2154 2017 - 2019       NA        
 8   10000117 F              48        2174 2008 - 2010       NA        
 9   10000178 F              59        2157 2017 - 2019       NA        
10   10000248 M              34        2192 2014 - 2016       NA        
# ℹ 299,702 more rows

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

Answer:

Plot gender:

patients_tble %>% ggplot() +
  geom_bar(aes(x = gender),
           color = "darkblue",
           fill = "lightskyblue") +
  labs(
    x = "Gender",
    y = "Count",
    title = "Gender count"
  ) +
  scale_x_discrete(labels = c("Female", "Male"))

  theme_light()
List of 97
 $ line                      :List of 6
  ..$ colour       : chr "black"
  ..$ linewidth    : num 0.5
  ..$ linetype     : num 1
  ..$ lineend      : chr "butt"
  ..$ arrow        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_line" "element"
 $ rect                      :List of 5
  ..$ fill         : chr "white"
  ..$ colour       : chr "black"
  ..$ linewidth    : num 0.5
  ..$ linetype     : num 1
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ text                      :List of 11
  ..$ family       : chr ""
  ..$ face         : chr "plain"
  ..$ colour       : chr "black"
  ..$ size         : num 11
  ..$ hjust        : num 0.5
  ..$ vjust        : num 0.5
  ..$ angle        : num 0
  ..$ lineheight   : num 0.9
  ..$ margin       : 'margin' num [1:4] 0points 0points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ title                     : NULL
 $ aspect.ratio              : NULL
 $ axis.title                : NULL
 $ axis.title.x              :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 1
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 2.75points 0points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.title.x.top          :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 0
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 2.75points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.title.x.bottom       : NULL
 $ axis.title.y              :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 1
  ..$ angle        : num 90
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 2.75points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.title.y.left         : NULL
 $ axis.title.y.right        :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 0
  ..$ angle        : num -90
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.75points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text                 :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : chr "grey30"
  ..$ size         : 'rel' num 0.8
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.x               :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 1
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 2.2points 0points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.x.top           :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 0
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 2.2points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.x.bottom        : NULL
 $ axis.text.y               :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 1
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 2.2points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.y.left          : NULL
 $ axis.text.y.right         :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 0
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.2points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.ticks                :List of 6
  ..$ colour       : chr "grey70"
  ..$ linewidth    : 'rel' num 0.5
  ..$ linetype     : NULL
  ..$ lineend      : NULL
  ..$ arrow        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_line" "element"
 $ axis.ticks.x              : NULL
 $ axis.ticks.x.top          : NULL
 $ axis.ticks.x.bottom       : NULL
 $ axis.ticks.y              : NULL
 $ axis.ticks.y.left         : NULL
 $ axis.ticks.y.right        : NULL
 $ axis.ticks.length         : 'simpleUnit' num 2.75points
  ..- attr(*, "unit")= int 8
 $ axis.ticks.length.x       : NULL
 $ axis.ticks.length.x.top   : NULL
 $ axis.ticks.length.x.bottom: NULL
 $ axis.ticks.length.y       : NULL
 $ axis.ticks.length.y.left  : NULL
 $ axis.ticks.length.y.right : NULL
 $ axis.line                 : list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ axis.line.x               : NULL
 $ axis.line.x.top           : NULL
 $ axis.line.x.bottom        : NULL
 $ axis.line.y               : NULL
 $ axis.line.y.left          : NULL
 $ axis.line.y.right         : NULL
 $ legend.background         :List of 5
  ..$ fill         : NULL
  ..$ colour       : logi NA
  ..$ linewidth    : NULL
  ..$ linetype     : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ legend.margin             : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
  ..- attr(*, "unit")= int 8
 $ legend.spacing            : 'simpleUnit' num 11points
  ..- attr(*, "unit")= int 8
 $ legend.spacing.x          : NULL
 $ legend.spacing.y          : NULL
 $ legend.key                :List of 5
  ..$ fill         : chr "white"
  ..$ colour       : logi NA
  ..$ linewidth    : NULL
  ..$ linetype     : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ legend.key.size           : 'simpleUnit' num 1.2lines
  ..- attr(*, "unit")= int 3
 $ legend.key.height         : NULL
 $ legend.key.width          : NULL
 $ legend.text               :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : 'rel' num 0.8
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ legend.text.align         : NULL
 $ legend.title              :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 0
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ legend.title.align        : NULL
 $ legend.position           : chr "right"
 $ legend.direction          : NULL
 $ legend.justification      : chr "center"
 $ legend.box                : NULL
 $ legend.box.just           : NULL
 $ legend.box.margin         : 'margin' num [1:4] 0cm 0cm 0cm 0cm
  ..- attr(*, "unit")= int 1
 $ legend.box.background     : list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ legend.box.spacing        : 'simpleUnit' num 11points
  ..- attr(*, "unit")= int 8
 $ panel.background          :List of 5
  ..$ fill         : chr "white"
  ..$ colour       : logi NA
  ..$ linewidth    : NULL
  ..$ linetype     : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ panel.border              :List of 5
  ..$ fill         : logi NA
  ..$ colour       : chr "grey70"
  ..$ linewidth    : 'rel' num 1
  ..$ linetype     : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ panel.spacing             : 'simpleUnit' num 5.5points
  ..- attr(*, "unit")= int 8
 $ panel.spacing.x           : NULL
 $ panel.spacing.y           : NULL
 $ panel.grid                :List of 6
  ..$ colour       : chr "grey87"
  ..$ linewidth    : NULL
  ..$ linetype     : NULL
  ..$ lineend      : NULL
  ..$ arrow        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_line" "element"
 $ panel.grid.major          :List of 6
  ..$ colour       : NULL
  ..$ linewidth    : 'rel' num 0.5
  ..$ linetype     : NULL
  ..$ lineend      : NULL
  ..$ arrow        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_line" "element"
 $ panel.grid.minor          :List of 6
  ..$ colour       : NULL
  ..$ linewidth    : 'rel' num 0.25
  ..$ linetype     : NULL
  ..$ lineend      : NULL
  ..$ arrow        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_line" "element"
 $ panel.grid.major.x        : NULL
 $ panel.grid.major.y        : NULL
 $ panel.grid.minor.x        : NULL
 $ panel.grid.minor.y        : NULL
 $ panel.ontop               : logi FALSE
 $ plot.background           :List of 5
  ..$ fill         : NULL
  ..$ colour       : chr "white"
  ..$ linewidth    : NULL
  ..$ linetype     : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ plot.title                :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : 'rel' num 1.2
  ..$ hjust        : num 0
  ..$ vjust        : num 1
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 5.5points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ plot.title.position       : chr "panel"
 $ plot.subtitle             :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 0
  ..$ vjust        : num 1
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 5.5points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ plot.caption              :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : 'rel' num 0.8
  ..$ hjust        : num 1
  ..$ vjust        : num 1
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 5.5points 0points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ plot.caption.position     : chr "panel"
 $ plot.tag                  :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : 'rel' num 1.2
  ..$ hjust        : num 0.5
  ..$ vjust        : num 0.5
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ plot.tag.position         : chr "topleft"
 $ plot.margin               : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
  ..- attr(*, "unit")= int 8
 $ strip.background          :List of 5
  ..$ fill         : chr "grey70"
  ..$ colour       : logi NA
  ..$ linewidth    : NULL
  ..$ linetype     : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ strip.background.x        : NULL
 $ strip.background.y        : NULL
 $ strip.clip                : chr "inherit"
 $ strip.placement           : chr "inside"
 $ strip.text                :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : chr "white"
  ..$ size         : 'rel' num 0.8
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 4.4points 4.4points 4.4points 4.4points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ strip.text.x              : NULL
 $ strip.text.x.bottom       : NULL
 $ strip.text.x.top          : NULL
 $ strip.text.y              :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : num -90
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ strip.text.y.left         :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : num 90
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ strip.text.y.right        : NULL
 $ strip.switch.pad.grid     : 'simpleUnit' num 2.75points
  ..- attr(*, "unit")= int 8
 $ strip.switch.pad.wrap     : 'simpleUnit' num 2.75points
  ..- attr(*, "unit")= int 8
 - attr(*, "class")= chr [1:2] "theme" "gg"
 - attr(*, "complete")= logi TRUE
 - attr(*, "validate")= logi TRUE

Plot anchor_age:

patients_tble %>% ggplot() +
  geom_histogram(aes(x = anchor_age, fill = gender),
                 binwidth = 5,
                 color = "black") +
  labs(
    x = "Age",
    y = "Count",
    title = "Age distribution vs. Gender"
  ) +
  guides(fill = guide_legend(title = "Gender")) +
  theme_light()

  • There are slightly more female patients than male patients in the patient.csv.gz data set. The total number of female patients is slightly over 150,000, while the total number of male patients is slightly below 150,000.

  • The age distribution is right-skewed, with most patients being younger than 30 years old. The number of patients shows a decreasing trend as age increases when age is greater than 55 years old. In general, the proportion of 2 genders in each age group is similar, some times female is slightly more than male.

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

Answer:

# chunk 33
labevents_tble <- arrow::open_dataset("labevents_pq", format = "parquet") %>%
  filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931) &
         subject_id %in% icustays_tble$subject_id) %>%
  collect()
labevents_tble
# A tibble: 13,473,870 × 16
   labevent_id subject_id hadm_id specimen_id itemid order_provider_id
         <int>      <int>   <int>       <int>  <int> <chr>            
 1          10   10000032      NA    52958335  50882 P28Z0X           
 2          14   10000032      NA    52958335  50902 P28Z0X           
 3          19   10000032      NA    52958335  50912 P28Z0X           
 4          33   10000032      NA    52958335  50971 P28Z0X           
 5          34   10000032      NA    52958335  50983 P28Z0X           
 6          60   10000032      NA    73913913  50931 P28Z0X           
 7          63   10000032      NA    81886159  51221 P28Z0X           
 8          75   10000032      NA    81886159  51301 P28Z0X           
 9          78   10000032      NA    31644093  51221 <NA>             
10          89   10000032      NA    31644093  51301 <NA>             
# ℹ 13,473,860 more rows
# ℹ 10 more variables: charttime <dttm>, storetime <dttm>, value <chr>,
#   valuenum <dbl>, valueuom <chr>, ref_range_lower <dbl>,
#   ref_range_upper <dbl>, flag <chr>, priority <chr>, comments <chr>
# Step 1: Join the tables on subject_id
# Optimized: combined into step 2 to save RAM

#joined_tble <- icustays_tble %>%
#  left_join(labevents_tble, by = "subject_id")
#joined_tble

# Step 2: Filter rows where lab event storetime is before ICU stay intime
pre_icu_labevents <- icustays_tble %>% # Step 1: Join the tables on subject_id
  left_join(labevents_tble, by = "subject_id") %>%
  filter(storetime < intime)
Warning in left_join(., labevents_tble, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 845 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
pre_icu_labevents
# A tibble: 14,674,449 × 23
   subject_id hadm_id.x stay_id first_careunit last_careunit intime             
        <dbl>     <dbl>   <dbl> <chr>          <chr>         <dttm>             
 1   10000032  29079034  3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 2   10000032  29079034  3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 3   10000032  29079034  3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 4   10000032  29079034  3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 5   10000032  29079034  3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 6   10000032  29079034  3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 7   10000032  29079034  3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 8   10000032  29079034  3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 9   10000032  29079034  3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
10   10000032  29079034  3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
# ℹ 14,674,439 more rows
# ℹ 17 more variables: outtime <dttm>, los <dbl>, labevent_id <int>,
#   hadm_id.y <int>, specimen_id <int>, itemid <int>, order_provider_id <chr>,
#   charttime <dttm>, storetime <dttm>, value <chr>, valuenum <dbl>,
#   valueuom <chr>, ref_range_lower <dbl>, ref_range_upper <dbl>, flag <chr>,
#   priority <chr>, comments <chr>
# Step 3: Pick out the storetime that is maximized but not greater than intime
last_labevents_pre_icu <- pre_icu_labevents %>%
  group_by(subject_id, stay_id, itemid) %>% 
  summarise(storetime = max(storetime)) 
`summarise()` has grouped output by 'subject_id', 'stay_id'. You can override
using the `.groups` argument.
# Get maximum storetime for each group
# Group: per itemid for per stay for per patient
last_labevents_pre_icu
# A tibble: 525,174 × 4
# Groups:   subject_id, stay_id [68,467]
   subject_id  stay_id itemid storetime          
        <dbl>    <dbl>  <int> <dttm>             
 1   10000032 39553978  50882 2180-07-23 00:59:00
 2   10000032 39553978  50902 2180-07-23 00:59:00
 3   10000032 39553978  50912 2180-07-23 00:59:00
 4   10000032 39553978  50931 2180-07-23 00:59:00
 5   10000032 39553978  50971 2180-07-23 01:14:00
 6   10000032 39553978  50983 2180-07-23 00:59:00
 7   10000032 39553978  51221 2180-07-23 00:00:00
 8   10000032 39553978  51301 2180-07-23 00:00:00
 9   10000980 39765666  50882 2189-06-27 00:56:00
10   10000980 39765666  50902 2189-06-27 00:56:00
# ℹ 525,164 more rows
# Step 4: Join the last lab events before ICU stay with the pre_icu_labevents
final_labevents_pre_icu <- last_labevents_pre_icu %>%
  left_join(pre_icu_labevents, 
            by = c("subject_id", "stay_id", "storetime", "itemid"))
final_labevents_pre_icu
# A tibble: 525,243 × 23
# Groups:   subject_id, stay_id [68,467]
   subject_id  stay_id itemid storetime           hadm_id.x first_careunit      
        <dbl>    <dbl>  <int> <dttm>                  <dbl> <chr>               
 1   10000032 39553978  50882 2180-07-23 00:59:00  29079034 Medical Intensive C…
 2   10000032 39553978  50902 2180-07-23 00:59:00  29079034 Medical Intensive C…
 3   10000032 39553978  50912 2180-07-23 00:59:00  29079034 Medical Intensive C…
 4   10000032 39553978  50931 2180-07-23 00:59:00  29079034 Medical Intensive C…
 5   10000032 39553978  50971 2180-07-23 01:14:00  29079034 Medical Intensive C…
 6   10000032 39553978  50983 2180-07-23 00:59:00  29079034 Medical Intensive C…
 7   10000032 39553978  51221 2180-07-23 00:00:00  29079034 Medical Intensive C…
 8   10000032 39553978  51301 2180-07-23 00:00:00  29079034 Medical Intensive C…
 9   10000980 39765666  50882 2189-06-27 00:56:00  26913865 Medical Intensive C…
10   10000980 39765666  50902 2189-06-27 00:56:00  26913865 Medical Intensive C…
# ℹ 525,233 more rows
# ℹ 17 more variables: last_careunit <chr>, intime <dttm>, outtime <dttm>,
#   los <dbl>, labevent_id <int>, hadm_id.y <int>, specimen_id <int>,
#   order_provider_id <chr>, charttime <dttm>, value <chr>, valuenum <dbl>,
#   valueuom <chr>, ref_range_lower <dbl>, ref_range_upper <dbl>, flag <chr>,
#   priority <chr>, comments <chr>
# Step 5: Pivot wider
wider_lab <- final_labevents_pre_icu %>%
  select(subject_id, stay_id, itemid, valuenum) %>%
  arrange(subject_id, stay_id) %>%
  pivot_wider(
    names_from = itemid,
    values_from = valuenum,
    values_fn = list(valuenum = last) # Use last to get the last measurement
  )
wider_lab
# A tibble: 68,467 × 10
# Groups:   subject_id, stay_id [68,467]
   subject_id  stay_id `50882` `50902` `50912` `50931` `50971` `50983` `51221`
        <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1   10000032 39553978      25      95     0.7     102     6.7     126    41.1
 2   10000980 39765666      21     109     2.3      89     3.9     144    27.3
 3   10001217 34592300      30     104     0.5      87     4.1     142    37.4
 4   10001217 37067082      22     108     0.6     112     4.2     142    38.1
 5   10001725 31205490      NA      98    NA        NA     4.1     139    NA  
 6   10001884 37510196      30      88     1.1     141     4.5     130    39.7
 7   10002013 39060235      24     102     0.9     288     3.5     137    34.9
 8   10002155 31090461      23      98     2.8     117     4.9     135    25.5
 9   10002155 32358465      26      85     1.4     133     5.7     120    22.4
10   10002155 33685454      24     105     1.1     138     4.6     139    39.7
# ℹ 68,457 more rows
# ℹ 1 more variable: `51301` <dbl>
# Step 6: Rename the original itemid to the corresponding lab measurement
wider_lab <- wider_lab %>%
  rename(
    creatinine = `50912`,
    potassium = `50971`,
    sodium = `50983`,
    chloride = `50902`,
    bicarbonate = `50882`,
    hematocrit = `51221`,
    wbc = `51301`,
    glucose = `50931`
  )
wider_lab
# A tibble: 68,467 × 10
# Groups:   subject_id, stay_id [68,467]
   subject_id  stay_id bicarbonate chloride creatinine glucose potassium sodium
        <dbl>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1   10000032 39553978          25       95        0.7     102       6.7    126
 2   10000980 39765666          21      109        2.3      89       3.9    144
 3   10001217 34592300          30      104        0.5      87       4.1    142
 4   10001217 37067082          22      108        0.6     112       4.2    142
 5   10001725 31205490          NA       98       NA        NA       4.1    139
 6   10001884 37510196          30       88        1.1     141       4.5    130
 7   10002013 39060235          24      102        0.9     288       3.5    137
 8   10002155 31090461          23       98        2.8     117       4.9    135
 9   10002155 32358465          26       85        1.4     133       5.7    120
10   10002155 33685454          24      105        1.1     138       4.6    139
# ℹ 68,457 more rows
# ℹ 2 more variables: hematocrit <dbl>, wbc <dbl>
labevents_tble <- wider_lab %>% ungroup()
labevents_tble
# A tibble: 68,467 × 10
   subject_id  stay_id bicarbonate chloride creatinine glucose potassium sodium
        <dbl>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1   10000032 39553978          25       95        0.7     102       6.7    126
 2   10000980 39765666          21      109        2.3      89       3.9    144
 3   10001217 34592300          30      104        0.5      87       4.1    142
 4   10001217 37067082          22      108        0.6     112       4.2    142
 5   10001725 31205490          NA       98       NA        NA       4.1    139
 6   10001884 37510196          30       88        1.1     141       4.5    130
 7   10002013 39060235          24      102        0.9     288       3.5    137
 8   10002155 31090461          23       98        2.8     117       4.9    135
 9   10002155 32358465          26       85        1.4     133       5.7    120
10   10002155 33685454          24      105        1.1     138       4.6    139
# ℹ 68,457 more rows
# ℹ 2 more variables: hematocrit <dbl>, wbc <dbl>
# Free RAM usage for subsequent rendering
rm(pre_icu_labevents, 
   last_labevents_pre_icu, 
   final_labevents_pre_icu, 
   wider_lab)

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.

Answer:

# chunk 37
chartevents_tble <- arrow::open_dataset("chartevents_pq",
                                       format = "parquet") %>%
  filter(itemid %in% c(220045, 220180, 220179, 223761, 220210) & 
         subject_id %in% icustays_tble$subject_id) %>%
  collect()
chartevents_tble
# A tibble: 22,504,119 × 11
   subject_id  hadm_id  stay_id caregiver_id charttime          
        <int>    <int>    <int>        <int> <dttm>             
 1   10001884 26184834 37510196        46969 2131-01-12 12:01:00
 2   10001884 26184834 37510196        46969 2131-01-12 12:01:00
 3   10001884 26184834 37510196        46969 2131-01-12 13:00:00
 4   10001884 26184834 37510196        46969 2131-01-12 13:00:00
 5   10001884 26184834 37510196        46969 2131-01-12 13:01:00
 6   10001884 26184834 37510196        46969 2131-01-12 13:01:00
 7   10001884 26184834 37510196        46969 2131-01-12 14:00:00
 8   10001884 26184834 37510196        46969 2131-01-12 14:00:00
 9   10001884 26184834 37510196        46969 2131-01-12 14:01:00
10   10001884 26184834 37510196        46969 2131-01-12 14:01:00
# ℹ 22,504,109 more rows
# ℹ 6 more variables: storetime <dttm>, itemid <int>, value <chr>,
#   valuenum <dbl>, valueuom <chr>, warning <int>
chartevents_tble_backup <- chartevents_tble
# Step 1: Pick out the minimum of charttime per ICU stay
first_chartevents_icu <- chartevents_tble %>%
  left_join(icustays_tble, by = c("subject_id", "stay_id")) %>%
  filter(charttime >= intime & charttime <= outtime) %>%
  group_by(subject_id, stay_id, itemid) %>% 
  summarise(charttime = min(charttime)) %>%
  ungroup()
`summarise()` has grouped output by 'subject_id', 'stay_id'. You can override
using the `.groups` argument.
first_chartevents_icu
# A tibble: 362,465 × 4
   subject_id  stay_id itemid charttime          
        <dbl>    <dbl>  <int> <dttm>             
 1   10000032 39553978 220045 2180-07-23 07:12:00
 2   10000032 39553978 220179 2180-07-23 07:11:00
 3   10000032 39553978 220180 2180-07-23 07:11:00
 4   10000032 39553978 220210 2180-07-23 07:12:00
 5   10000032 39553978 223761 2180-07-23 07:00:00
 6   10000980 39765666 220045 2189-06-27 01:56:00
 7   10000980 39765666 220179 2189-06-27 01:55:00
 8   10000980 39765666 220180 2189-06-27 01:55:00
 9   10000980 39765666 220210 2189-06-27 01:54:00
10   10000980 39765666 223761 2189-06-27 02:07:00
# ℹ 362,455 more rows
# Step 2: Join the first_chartevents_icu with the chartevents_tble
final_chartevents_icu <- first_chartevents_icu %>%
  left_join(chartevents_tble, 
            by = c("subject_id", "stay_id", "charttime", "itemid"))
final_chartevents_icu
# A tibble: 362,465 × 11
   subject_id  stay_id itemid charttime            hadm_id caregiver_id
        <dbl>    <dbl>  <int> <dttm>                 <int>        <int>
 1   10000032 39553978 220045 2180-07-23 07:12:00 29079034        88981
 2   10000032 39553978 220179 2180-07-23 07:11:00 29079034        88981
 3   10000032 39553978 220180 2180-07-23 07:11:00 29079034        88981
 4   10000032 39553978 220210 2180-07-23 07:12:00 29079034        88981
 5   10000032 39553978 223761 2180-07-23 07:00:00 29079034        88981
 6   10000980 39765666 220045 2189-06-27 01:56:00 26913865        36518
 7   10000980 39765666 220179 2189-06-27 01:55:00 26913865        36518
 8   10000980 39765666 220180 2189-06-27 01:55:00 26913865        36518
 9   10000980 39765666 220210 2189-06-27 01:54:00 26913865        36518
10   10000980 39765666 223761 2189-06-27 02:07:00 26913865        36518
# ℹ 362,455 more rows
# ℹ 5 more variables: storetime <dttm>, value <chr>, valuenum <dbl>,
#   valueuom <chr>, warning <int>
# Step 3: Pivot wider
wider_chart <- final_chartevents_icu %>%
  select(subject_id, stay_id, itemid, valuenum) %>%
  arrange(subject_id, stay_id) %>%
  pivot_wider(
    names_from = itemid,
    values_from = valuenum,
    values_fn = list(valuenum = first) # Use first to get the first measurement
  )
wider_chart
# A tibble: 73,164 × 7
   subject_id  stay_id `220045` `220179` `220180` `220210` `223761`
        <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1   10000032 39553978       91       84       48       24     98.7
 2   10000980 39765666       77      150       77       23     98  
 3   10001217 34592300       96      167       95       11     97.6
 4   10001217 37067082       86      151       90       18     98.5
 5   10001725 31205490       55       73       56       19     97.7
 6   10001884 37510196       38      180       12       10     98.1
 7   10002013 39060235       80      104       70       14     97.2
 8   10002155 31090461       94      118       51       18     96.9
 9   10002155 32358465       98      109       65       23     97.7
10   10002155 33685454       68      126       61       18     95.9
# ℹ 73,154 more rows
# Step 4: Rename the original itemid to the corresponding vital measurement
wider_chart <- wider_chart %>%
  rename(
    heart_rate = `220045`,
    non_invasive_blood_pressure_systolic = `220179`,
    non_invasive_blood_pressure_diastolic = `220180`,
    temperature_fahrenheit = `223761`,
    respiratory_rate = `220210`
  )
wider_chart
# A tibble: 73,164 × 7
   subject_id  stay_id heart_rate non_invasive_blood_pr…¹ non_invasive_blood_p…²
        <dbl>    <dbl>      <dbl>                   <dbl>                  <dbl>
 1   10000032 39553978         91                      84                     48
 2   10000980 39765666         77                     150                     77
 3   10001217 34592300         96                     167                     95
 4   10001217 37067082         86                     151                     90
 5   10001725 31205490         55                      73                     56
 6   10001884 37510196         38                     180                     12
 7   10002013 39060235         80                     104                     70
 8   10002155 31090461         94                     118                     51
 9   10002155 32358465         98                     109                     65
10   10002155 33685454         68                     126                     61
# ℹ 73,154 more rows
# ℹ abbreviated names: ¹​non_invasive_blood_pressure_systolic,
#   ²​non_invasive_blood_pressure_diastolic
# ℹ 2 more variables: respiratory_rate <dbl>, temperature_fahrenheit <dbl>
chartevents_tble <- wider_chart %>% ungroup()

# Free RAM usage for subsequent rendering
rm(first_chartevents_icu, final_chartevents_icu, wider_chart)

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

Answer:

# Step 1: Join the icustays_tble with the patients_tble
mimic_icu_cohort <- icustays_tble %>%
  left_join(patients_tble, by = "subject_id")

# Step 2: Join the mimic_icu_cohort with the admissions_tble and 
# filter out the patients who are less than 18 years old
mimic_icu_cohort <- mimic_icu_cohort %>%
  mutate(age_intime = year(intime) - anchor_year + anchor_age)

mimic_icu_cohort <- mimic_icu_cohort %>%
  left_join(admissions_tble, by = c("subject_id", "hadm_id")) %>%
  filter(age_intime >= 18)

# Step 3: Join the mimic_icu_cohort with labevents_tble
mimic_icu_cohort <- mimic_icu_cohort %>%
  left_join(labevents_tble, by = c("subject_id", "stay_id"))

# Step 4: Join the mimic_icu_cohort with chartevents_tble
mimic_icu_cohort <- mimic_icu_cohort %>%
  left_join(chartevents_tble, by = c("subject_id", "stay_id"))
mimic_icu_cohort
# A tibble: 73,181 × 41
   subject_id  hadm_id  stay_id first_careunit last_careunit intime             
        <dbl>    <dbl>    <dbl> <chr>          <chr>         <dttm>             
 1   10000032 29079034 39553978 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 2   10000980 26913865 39765666 Medical Inten… Medical Inte… 2189-06-27 08:42:00
 3   10001217 24597018 37067082 Surgical Inte… Surgical Int… 2157-11-20 19:18:02
 4   10001217 27703517 34592300 Surgical Inte… Surgical Int… 2157-12-19 15:42:24
 5   10001725 25563031 31205490 Medical/Surgi… Medical/Surg… 2110-04-11 15:52:22
 6   10001884 26184834 37510196 Medical Inten… Medical Inte… 2131-01-11 04:20:05
 7   10002013 23581541 39060235 Cardiac Vascu… Cardiac Vasc… 2160-05-18 10:00:53
 8   10002155 20345487 32358465 Medical Inten… Medical Inte… 2131-03-09 21:33:00
 9   10002155 23822395 33685454 Coronary Care… Coronary Car… 2129-08-04 12:45:00
10   10002155 28994087 31090461 Medical/Surgi… Medical/Surg… 2130-09-24 00:50:00
# ℹ 73,171 more rows
# ℹ 35 more variables: outtime <dttm>, los <dbl>, gender <chr>,
#   anchor_age <dbl>, anchor_year <dbl>, anchor_year_group <chr>, dod <date>,
#   age_intime <dbl>, admittime <dttm>, dischtime <dttm>, deathtime <dttm>,
#   admission_type <chr>, admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>, …

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

  • Length of ICU stay los vs the last available lab measurements before ICU stay

  • Length of ICU stay los vs the average vital measurements within the first hour of ICU stay

  • Length of ICU stay los vs first ICU unit

Answer:

8.1 Length of ICU stay los vs. demographic variables

Age at intime

# Scatter plot of los vs. age
los_age <- mimic_icu_cohort %>%
  group_by(age_intime) %>%
  summarise(meanlos = mean(los))

meanlos_max <- los_age %>% slice_max(order_by = meanlos, n = 1)
meanlos_min <- los_age %>% slice_min(order_by = meanlos, n = 1)

los_age %>%
  ggplot(aes(x = age_intime, y = meanlos)) +
  geom_line() +
  geom_smooth(color = "orange") +
  labs(
    x = "Age at intime",
    y = "Averaged Length of ICU stay (days)",
    title = "Length of ICU stay vs. age"
  ) +
  geom_point(data = meanlos_max, 
             aes(x = age_intime, y = meanlos), 
             color = "red",
             size = 5) +
  geom_point(data = meanlos_min, 
             aes(x = age_intime, y = meanlos), 
             color = "blue",
             size = 5) +
  annotate("text", 
           x = meanlos_max$age_intime, 
           y = meanlos_max$meanlos + 0.2, 
           label = str_c("Max: ", round(meanlos_max$meanlos, 2), " days, age ",
                        meanlos_max$age_intime), 
           color = "red") +
  annotate("text",
           x = meanlos_min$age_intime - 10, 
           y = meanlos_min$meanlos - 0.2, 
           label = str_c("Min: ", round(meanlos_min$meanlos, 2), " days, age ",
                        meanlos_min$age_intime), 
           color = "blue") +
  theme_light()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  • From the line plot, we can see that, as the age_intime increases, the averaged length of ICU stay increases at first until somewhere around 70 years old, then it starts to decrease. The maximum averaged length of ICU stay is 4.61 days at age 27, and the minimum is 1.88 days at age 102. This may be due to the fact that the older the patient, the more likely they are to have more severe diseases, which may lead to an earlier clinical death that shortens the length of ICU stay.

Race

los_race <- mimic_icu_cohort %>% 
  select(race, los) %>%
  group_by(race) %>%
  summarise(mean_los = mean(los)) %>%
  arrange(desc(mean_los)) %>%
  print()
# A tibble: 33 × 2
   race                          mean_los
   <chr>                            <dbl>
 1 AMERICAN INDIAN/ALASKA NATIVE     4.46
 2 PORTUGUESE                        4.26
 3 ASIAN - KOREAN                    4.23
 4 UNABLE TO OBTAIN                  4.22
 5 UNKNOWN                           4.20
 6 ASIAN - ASIAN INDIAN              4.20
 7 HISPANIC/LATINO - DOMINICAN       3.86
 8 BLACK/AFRICAN                     3.82
 9 WHITE - OTHER EUROPEAN            3.66
10 BLACK/CARIBBEAN ISLAND            3.61
# ℹ 23 more rows
los_race %>% head(1)
# A tibble: 1 × 2
  race                          mean_los
  <chr>                            <dbl>
1 AMERICAN INDIAN/ALASKA NATIVE     4.46
los_race %>% tail(1)
# A tibble: 1 × 2
  race                      mean_los
  <chr>                        <dbl>
1 HISPANIC/LATINO - MEXICAN     2.65
  • From the numeric summary, we can see that the mean length of ICU stay is the longest for race AMERICAN INDIAN/ALASKA NATIVE which is 4.455923 and the shortest for race HISPANIC/LATINO - MEXICAN which is 2.648043.

Insurance

mimic_icu_cohort %>% 
  ggplot(aes(x = insurance, y = los)) +
  geom_boxplot(fill = "lightskyblue",
               color = "darkblue") +
  labs(
    title = "Length of ICU stay vs. insurance"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_light()

mimic_icu_cohort %>%
  group_by(insurance) %>%
  summarise(mean_los = mean(los)) %>%
  arrange(desc(mean_los)) %>%
  print()
# A tibble: 3 × 2
  insurance mean_los
  <chr>        <dbl>
1 Medicare      3.48
2 Medicaid      3.46
3 Other         3.42
  • From the result, we can see that the averaged length of ICU stay is the longest for Medicare and the shortest for Other. But the difference between each group is not very noticeable.

Marital status

mimic_icu_cohort %>% 
  ggplot(aes(x = marital_status, y = los)) +
  geom_boxplot(fill = "lightskyblue",
               color = "darkblue") +
  labs(
    title = "Length of ICU stay vs. marital status"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_light()

mimic_icu_cohort %>%
  group_by(marital_status) %>%
  summarise(mean_los = mean(los)) %>%
  arrange(desc(mean_los)) %>%
  print()
# A tibble: 5 × 2
  marital_status mean_los
  <chr>             <dbl>
1 <NA>               4.28
2 MARRIED            3.46
3 SINGLE             3.40
4 DIVORCED           3.39
5 WIDOWED            3.13
  • Again, the difference of averaged los between each group is not very noticeable for those who stated their marital_status. One thing to note is that the averaged los for group NA is the longest which is about 4.28 days.

Gender

mimic_icu_cohort %>%
  ggplot(aes(x = gender, y = los)) +
  geom_boxplot(fill = "lightskyblue",
               color = "darkblue") +
  labs(
    title = "Length of ICU stay vs. gender"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_light()

mimic_icu_cohort %>%
  group_by(gender) %>%
  summarise(mean_los = mean(los)) %>%
  arrange(desc(mean_los)) %>%
  print()
# A tibble: 2 × 2
  gender mean_los
  <chr>     <dbl>
1 M          3.54
2 F          3.34
  • From the result, we can see that the averaged los is about 3.54 days in Male group and 3.34 days in Female group. Although Male’s averaged los is slightly longer than Female’s, the difference is not very noticeable.

8.2 Length of ICU stay los vs. the last available lab measurements before ICU stay

mimic_icu_cohort %>%
  select(los, creatinine, potassium, sodium, chloride, bicarbonate, hematocrit, 
         wbc, glucose) %>%
  pivot_longer(c("creatinine", "potassium", "sodium", "chloride", "bicarbonate",
                 "hematocrit", "wbc", "glucose"),
               names_to = "measurement",
               values_to = "value") %>%
  ggplot(aes(x = value, y = los)) +
  geom_point(aes(x = value, y = los, color = measurement),
             alpha = 0.3) +
  geom_smooth() +
  facet_grid(measurement ~ ., 
             scales = "free")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 60686 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 60686 rows containing missing values (`geom_point()`).

  • From the plot we can see that there is no clear pattern between los and the most of the last available lab measurements before ICU stay.

  • However, from the plot we can glimpse that there is a subtle trend that a higher bicaarbonate level and a higher hematocrit level are associated with a longer los, while a higher sodium level and a higher wbc level are associated with a shorter los.

8.3 Length of ICU stay los vs. the average vital measurements within the first hour of ICU stay

mimic_icu_cohort %>%
  select(los, heart_rate, non_invasive_blood_pressure_systolic, 
         non_invasive_blood_pressure_diastolic, temperature_fahrenheit, 
         respiratory_rate) %>%
  pivot_longer(c("heart_rate", 
                 "non_invasive_blood_pressure_systolic", 
                 "non_invasive_blood_pressure_diastolic", 
                 "temperature_fahrenheit", 
                 "respiratory_rate"),
               names_to = "measurement",
               values_to = "value") %>%
  filter(value < quantile(value, 0.99, na.rm = TRUE) &
         value > quantile(value, 0.01, na.rm = TRUE)) %>% # Remove outliers
  ggplot(aes(x = value, y = los)) +
  geom_point(aes(x = value, y = los, color = measurement),
             alpha = 0.5) +
  geom_smooth() +
  facet_grid(measurement ~ .)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

  • From the plot we can see that there is no clear pattern between los and the vital measurements within the first hour of ICU stay.

8.4 Length of ICU stay los vs. first ICU unit

# Graphical summary
mimic_icu_cohort %>%
  group_by(first_careunit) %>%
  filter(los < quantile(los, 0.99, na.rm = TRUE)) %>% # Remove outliers
  ggplot(aes(x = first_careunit, y = los)) +
  geom_boxplot(fill = "lightskyblue",
               color = "darkblue") +
  labs(
    title = "Length of ICU stay vs. first ICU unit"
  ) +
  theme_light() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Numeric summary
mimic_icu_cohort %>%
  group_by(first_careunit) %>%
  summarise(mean_los = mean(los)) %>%
  arrange(desc(mean_los)) %>%
  print()
# A tibble: 9 × 2
  first_careunit                                   mean_los
  <chr>                                               <dbl>
1 Neuro Surgical Intensive Care Unit (Neuro SICU)      6.30
2 Surgical Intensive Care Unit (SICU)                  3.84
3 Trauma SICU (TSICU)                                  3.83
4 Neuro Intermediate                                   3.42
5 Cardiac Vascular Intensive Care Unit (CVICU)         3.29
6 Medical Intensive Care Unit (MICU)                   3.26
7 Coronary Care Unit (CCU)                             3.19
8 Medical/Surgical Intensive Care Unit (MICU/SICU)     3.08
9 Neuro Stepdown                                       2.59
  • From the result we can see that Neuro SICU has the longest averaged los which is about 6.30 days, and Neuro Stepdown has the shortest averaged los which is about 2.59 days.

  • Neuro SICU’s averaged los is significantly longer than the rest of the units, from which we can infer that patients in Neuro SICU may have more severe diseases that require a longer ICU stay.